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Abstract. We present the nested Chinese restaurant process (nCRP), 
a stochastic process which assigns probability distributions to infinitely- 
deep, infinitely -branching trees. We show how this stochastic process 
can be used as a prior distribution in a Bayesian nonparametric model of 
document collections. Specifically, we present an application to infor- 
mation retrieval in which documents are modeled as paths down a ran- 
dom tree, and the preferential attachment dynamics of the nCRP leads to 
clustering of documents according to sharing of topics at multiple levels 
of abstraction. Given a corpus of documents, a posterior inference algo- 
rithm finds an approximation to a posterior distribution over trees, topics 
and allocations of words to levels of the tree. We demonstrate this al- 
gorithm on collections of scientific abstracts from several journals. This 
model exemplifies a recent trend in statistical machine learning — the use 
of Bayesian nonparametric methods to infer distributions on flexible data 
structures. 



1. INTRODUCTION 

For much of its history, computer science has focused on deductive for- 
mal methods, allying itself with deductive traditions in areas of mathematics 
such as set theory, logic, algebra, and combinatorics. There has been ac- 
cordingly less focus on efforts to develop inductive, empirically-based for- 
malisms in computer science, a gap which became increasingly visible over 
the years as computers have been required to interact with noisy, difficult- 
to-characterize sources of data, such as those deriving from physical signals 
or from human activity. In more recent history, the field of machine learning 
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has aimed to fill this gap, allying itself with inductive traditions in probabil- 
ity and statistics, while focusing on methods that are amenable to analysis 
as computational procedures. 

Machine learning methods can be divided into supervised learning meth- 
ods and unsupervised learning methods. Supervised learning has been a 
major focus of machine learning research. In supervised learning, each data 
point is associated with a label (e.g., a category, a rank or a real number) 
and the goal is to find a function that maps data into labels (so as to predict 
the labels of data that have not yet been labeled). A canonical example of 
supervised machine learning is the email spam filter, which is trained on 
known spam messages and then used to mark incoming unlabeled email as 
spam or non-spam. 

While supervised learning remains an active and vibrant area of research, 
more recently the focus in machine learning has turned to unsupervised 
learning methods. In unsupervised learning the data are not labeled, and the 
broad goal is to find patterns and structure within the data set. Different for- 
mulations of unsupervised learning are based on different notions of "pat- 
tern" and "structure." Canonical examples include clustering, the problem 
of grouping data into meaningful groups of similar points, and dimension 
reduction, the problem of finding a compact representation that retains use- 
ful information in the data set. One way to render these notions concrete is 
to tie them to a supervised learning problem; thus, a structure is validated 
if it aids the performance of an associated supervised learning system. Of- 
ten, however, the goal is more exploratory. Inferred structures and patterns 
might be used, for example, to visualize or organize the data according 
to subjective criteria. With the increased access to all kinds of unlabeled 
data — scientific data, personal data, consumer data, economic data, govern- 
ment data, text data — exploratory unsupervised machine learning methods 
have become increasingly prominent. 

Another important dichotomy in machine learning distinguishes between 
parametric and nonparametric models. A parametric model involves a fixed 
representation that does not grow structurally as more data are observed. 
Examples include linear regression and clustering methods in which the 
number of clusters is fixed a priori. A nonparametric model, on the other 
hand, is based on representations that are allowed to grow structurally as 
more data are observedj] Nonparametric approaches are often adopted when 
the goal is to impose as few assumptions as possible and to "let the data 
speak." 



In particular, despite the nomenclature, a nonparametric model can involve parameters; 
the issue is whether or not the number of parameters grows as more data are observed. 
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The nonparametric approach underhes many of the most significant de- 
velopments in the supervised learning branch of machine learning over the 
past two decades. In particular, modern classifiers such as decision trees, 
boosting and nearest neighbor methods are nonparametric, as are the class 
of supervised learning systems built on "kernel methods," including the sup- 
port vector machine. (See [Hastie et"aLl ( |2001| ) for a good review of these 
methods.) Theoretical developments in supervised learning have shown that 
as the number of data points grows, these methods can converge to the true 
labeling function underlying the data, even when the data lie in an uncount- 
ably infinite space and the labeling function is arbitrary ( [Devroye et al 



|1996| ). This would clearly not be possible for parametric classifiers. 

The assumption that labels are available in supervised learning is a strong 
assumption, but it has the virtue that few additional assumptions are gener- 
ally needed to obtain a useful supervised learning methodology. In unsu- 
pervised learning, on the other hand, the absence of labels and the need to 
obtain operational definitions of "pattern" and "structure" generally makes 
it necessary to impose additional assumptions on the data source. In par- 
ticular, unsupervised learning methods are often based on "generative mod- 
els," which are probabilistic models that express hypotheses about the way 
in which the data may have been generated. Probabilistic graphical mod- 
els (also known as "Bayesian networks" and "Markov random fields") have 
emerged as a broadly useful approach to specifying generative models ( |Lau-| 
|ritzen[ [T996t | Jordan[ [2000) . The elegant marriage of graph theory and prob- 
ability theory in graphical models makes it possible to take a fully proba- 
bilistic (i.e., Bayesian) approach to unsupervised learning in which efficient 
algorithms are available to update a prior generative model into a posterior 
generative model once data have been observed. 

Although graphical models have catalyzed much research in unsuper- 
vised learning and have had many practical successes, it is important to note 
that most of the graphical model literature has been focused on parametric 
models. In particular, the graphs and the local potential functions compris- 
ing a graphical model are viewed as fixed objects; they do not grow struc- 
turally as more data are observed. Thus, while nonparametric methods have 
dominated the literature in supervised learning, parametric methods have 
dominated in unsupervised learning. This may seem surprising given that 
the open-ended nature of the unsupervised learning problem seems partic- 
ularly commensurate with the nonparametric philosophy. But it reflects an 
underlying tension in unsupervised learning — to obtain a well-posed learn- 
ing problem it is necessary to impose assumptions, but the assumptions 
should not be too strong or they will inform the discovered structure more 
than the data themselves. 
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It is our view that the framework of Bayesian nonparametric statistics 
provides a general way to lessen this tension and to pave the way to un- 
supervised learning methods that combine the virtues of the probabilistic 
approach embodied in graphical models with the nonparametric spirit of 
supervised learning. In Bayesian nonparametric (BNP) inference, the prior 
and posterior distributions are no longer restricted to be parametric distri- 
butions, but are general stochastic processes [Hjort et aL] ( |2009| ). Recall 
that a stochastic process is simply an indexed collection of random vari- 
ables, where the index set is allowed to be infinite. Thus, using stochas- 
tic processes, the objects of Bayesian inference are no longer restricted to 
finite-dimensional spaces, but are allowed to range over general infinite- 
dimensional spaces. For example, objects such as trees of arbitrary branch- 
ing factor and arbitrary depth are allowed within the BNP framework, as 
are other structured objects of open-ended cardinality such as partitions and 
lists. It is also possible to work with stochastic processes that place distribu- 
tions on functions and distributions on distributions. The latter fact exhibits 
the potential for recursive constructions that is available within the BNP 
framework. In general, we view the representational flexibility of the BNP 
framework as a statistical counterpart of the flexible data structures that are 
ubiquitous in computer science. 

In this paper, we aim to introduce the BNP framework to a wider com- 
putational audience by showing how BNP methods can be deployed in 
a specific unsupervised machine learning problem of significant current 
interest — that of learning topic models for collections of text, images and 
other semi-structured corpora |Blei et al.| ( [2003| ); |Griffiths and Steyvers| ( |2006| ); 
Blei and Laffertyl ( [20091 ). 



Let us briefly introduce the problem here; a more formal presentation ap- 
pears in Section |4j A topic is defined to be a probability distribution across 
words from a vocabulary. Given an input corpus — a set of documents each 
consisting of a sequence of words — we want an algorithm to both find use- 
ful sets of topics and learn to organize the topics according to a hierarchy 
in which more abstract topics are near the root of the hierarchy and more 
concrete topics are near the leaves. While a classical unsupervised analy- 
sis might require the topology of the hierarchy (branching factors, etc) to 
be chosen in advance, our BNP approach aims to infer a distribution on 
topologies, in particular placing high probability on those hierarchies that 
best explain the data. Moreover, in accordance with our goals of using 
flexible models that "let the data speak," we wish to allow this distribution 
to have its support on arbitrary topologies — there should be no limitations 
such as a maximum depth or maximum branching factor. 

We provide an example of the output from our algorithm in Figure [TJ The 
input corpus in this case was a collection of abstracts from the Journal of the 
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I Property testing and its connection to learning and approximation 
I Fully dynamic planarity testing with applications 
I Recognizing planar perfect graphs 

I The coloring and maximum independent set problems on planar perfect graphs 



Learning to reason 
Learning Boolean formulas 

Learning functions represented as multiplicity automata 
Dense quantum coding and quantum finite automata 
A neuroidal architecture for cognitive computation 




I Altemating-time temporal logic 

I Fixpoint logics, relational machines, and computational complexity 
I Definable relations and first-order query languages over strings 
I Autoepistemic logic 

I Expressiveness of structured document query languages... 



Planar-adaptive routing: low-cost adaptive networks for multiprocessors 
On-line analysis of the TCP acknowledgment delay problem 
A trade-off between space and efficiency for routing tables 

Universal-stability results and performance bounds for greedy contention-resolution protocols 
Periodification scheme: constructing sorting networks with constant period 



Figure 1 . The topic hierarchy learned from 536 abstracts 
of the Journal of the ACM (JACM) from 1987-2004. The 
vocabulary was restricted to the 1,539 terms that occurred in 
more than five documents, yielding a corpus of 68K words. 
The learned hierarchy contains 25 topics, and each topic 
node is annotated with its top five most probable terms. We 
also present examples of documents associated with a subset 
of the paths in the hierarchy. 



ACM (JACM) from the years 1987 to 2004. The figure depicts a topology 
that is given highest probability by our algorithm, along with the highest 
probability words from the topics associated with this topology (each node 
in the tree corresponds to a single topic). As can be seen from the figure, 
the algorithm has discovered the category of function words at level zero 
(e.g., "the" and "of"), and has discovered a set of first-level topics that are 
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a reasonably faithful representation of some of the main areas of computer 
science. The second level provides a further subdivision into more concrete 
topics. We emphasize that this is an unsupervised problem. The algorithm 
discovers the topic hierarchy without any extra information about the corpus 
(e.g., keywords, titles or authors). The documents are the only inputs to the 
algorithm. 

A learned topic hierarchy can be useful for many tasks, including text 
categorization, text compression, text summarization and language model- 
ing for speech recognition. A commonly-used surrogate for the evaluation 
of performance in these tasks is predictive likelihood, and we use predic- 
tive likelihood to evaluate our methods quantitatively. But we also view our 
work as making a contribution to the development of methods for the visu- 
alization and browsing of documents. The model and algorithm we describe 
can be used to build a topic hierarchy for a document collection, and that 
hierarchy can be used to sharpen a user's understanding of the contents of 
the collection. A qualitative measure of the success of our approach is that 
the same tool should be able to uncover a useful topic hierarchy in different 
domains based solely on the input data. 

By defining a probabilistic model for documents, we do not define the 
level of "abstraction" of a topic formally, but rather define a statistical pro- 
cedure that allows a system designer to capture notions of abstraction that 
are reflected in usage patterns of the specific corpus at hand. While the 
content of topics will vary across corpora, the ways in which abstraction 
interacts with usage will not. A corpus might be a collection of images, 
a collection of HTML documents or a collection of DNA sequences. Dif- 
ferent notions of abstraction will be appropriate in these different domains, 
but each are expressed and discoverable in the data, making it possible to 
automatically construct a hierarchy of topics. 

This paper is organized as follows. We begin with a review of the nec- 
essary background in stochastic processes and Bayesian nonparametric sta- 
tistics in Section |2} In Section [3| we develop the nested Chinese restaurant 
process, the prior on topologies that we use in the hierarchical topic model 
of Section |4j We derive an approximate posterior inference algorithm in 
Section [5] to learn topic hierarchies from text data. Examples and an empir- 
ical evaluation are provided in Section [6j Finally, we present related work 
and a discussion in Section |7l 
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Figure 2. A configuration of the Chinese restaurant pro- 
cess. There are an infinite number of tables, each associated 
with a parameter /?^. The customers sit at the tables accord- 
ing to Eq. ([!]) and each generate data with the corresponding 
parameter. In this configuration, ten customers have been 
seated in the restaurant, populating four of the infinite set of 
tables. 



2. Background 

Our approach to topic modeling reposes on several building blocks from 
stochastic process theory and Bayesian nonparametric statistics, specifi- 
cally the Chinese restaurant process ( |Aldous[ |1985| ), stick-breaking pro- 
cesses (Pit man[ [2002[ ), and the Dirichlet process mixture ( |Antoniak]|1974| ). 
In this section we briefly review these ideas and the connections between 
them. 



2.1. Dirichlet and beta distributions. Recall that the Dirichlet distribu- 
tion is a probability distribution on the simplex of nonnegative real numbers 
that sum to one. We write 

U Dir(ai,a2, . . . ,o^x), 

for a random vector U distributed as a Dirichlet random variable on the K- 
simplex, where > are parameters. The mean of U is proportional to 
the parameters 

m] = 

and the magnitude of the parameters determines the concentration of U 
around the mean. The specific choice ai = • • • = ax = I yields the 
uniform distribution on the simplex. Letting > 1 yields a unimodal dis- 
tribution peaked around the mean, and letting < 1 yields a distribution 
that has modes at the corners of the simplex. The beta distribution is a spe- 
cial case of the Dirichlet distribution for K = 2, in which case the simplex 
is the unit interval (0, 1). In this case we write U ^ Beta(ai, 0^2), where U 
is a scalar. 
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2.2. Chinese restaurant process. The Chinese restaurant process (CRP) 
is a single parameter distribution over partitions of the integers. The dis- 
tribution can be most easily described by specifying how to draw a sample 
from it. Consider a restaurant with an infinite number of tables each with 
infinite capacity. A sequence of customers arrive, labeled with the inte- 
gers {1, . . . , A^}. The first customer sits at the first table; the nth subsequent 
customer sits at a table drawn from the following distribution: 

^(occupied table i \ previous customers) = 
p(next unoccupied table I previous customers) = ^^^-i ' 

where rii is the number of customers currently sitting at table i, and 7 is a 
real- valued parameter which controls how often, relative to the number of 
customers in the restaurant, a customer chooses a new table versus sitting 
with others. After customers have been seated, the seating plan gives a 
partition of those customers as illustrated in Figure |2} 

With an eye towards Bayesian statistical applications, we assume that 
each table is endowed with a parameter vector /? drawn from a distribution 
Go- Each customer is associated with the parameter vector at the table at 
which he sits. The resulting distribution on sequences of parameter values 



is referred to as a Poly a urn model (Johnson and Kotz, 1977). 



The Polya urn distribution can be used to define a flexible clustering 
model. Let the parameters at the tables index a family of probability distri- 
butions (for example, the distribution might be a multivariate Gaussian in 
which case the parameter would be a mean vector and covariance matrix). 
Associate customers to data points, and draw each data point from the prob- 
ability distribution associated with the table at which the customer sits. This 
induces a probabilistic clustering of the generated data because customers 
sitting around each table share the same parameter vector. 



This model is in the spirit of a traditional mixture model (Titterington 



et al.[ |1985| ), but is critically different in that the number of tables is un- 
bounded. Data analysis amounts to inverting the generative process to de- 
termine a probability distribution on the "seating assignment" of a data set. 
The underlying CRP lets the data determine the number of clusters (i.e., the 
number of occupied tables) and further allows new data to be assigned to 
new clusters (i.e., new tables). 

2.3. Stick-breaking constructions. The Dirichlet distribution places a dis- 
tribution on nonnegative K-dimensional vectors whose components sum to 
one. In this section we discuss a stochastic process that allows K to be 
unbounded. 

Consider a collection of nonnegative real numbers where ^ • Oi = 

1. We wish to place a probability distribution on such sequences. Given 
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that each such sequence can be viewed as a probabihty distribution on the 
positive integers, we obtain a distribution on distributions, i.e., a random 
probabihty distribution. 

To do this, we use a stick-breaking construction. View the interval (0, 1) 
as a unit-length stick. Draw a value Vi from a Beta(ai, 0^2) distribution and 
break off a fraction Vi of the stick. Let 9i = Vi denote this first fragment 
of the stick and let 1 — 9i denote the remainder of the stick. Continue this 
procedure recursively, letting O2 = V2{1 — 9i), and in general define 

i-l 

0^ = V,l[{l-V,), 

where {Vi} are an infinite sequence of independent draws from the Beta(ai , 0^2) 



distribution. Sethuraman ( 1994) shows that the resulting sequence {9i} sat- 
isfies ^iOi = 1 with probability one. 

In the special case ai = 1 we obtain a one-parameter stochastic pro- 



cess known as the GEM distribution ( |Pitman[ |2002| ). Let 7 = 0^2 denote 
this parameter and denote draws from this distribution as 6^ ^ GEM (7). 
Large values of 7 skew the beta distribution towards zero and yield ran- 
dom sequences that are heavy-tailed, i.e., significant probability tends to be 
assigned to large integers. Small values of 7 yield random sequences that 
decay more quickly to zero. 

2.4. Connections. The GEM distribution and the CRP are closely related. 
Let 9 ^ GEM (7) and let {Zi, Z2, . . . , Z^} be a sequence of indicator vari- 
ables drawn independently from 9, i.e., 

p{Z^ = i\9) = 9,. 

This distribution on indicator variables induces a random partition on the 
integers {1,2,..., A^}, where the partition reflects indicators that share the 
same values. It can be shown that this distribution on partitions is the same 



as the distribution on partitions induced by the CRP ( |Pitman[ |2002| ). As 
implied by this result, the GEM parameter 7 controls the partition in the 
same way as the CRP parameter 7. 

As with the CRP, we can augment the GEM distribution to consider draws 
of parameter values. Let be an infinite sequence of independent draws 
from a distribution Gq defined on a sample space i7. Define 

00 

i=l 

where 5f3. is an atom at location and where 9 ^ GEM (7). The object G 
is a distribution on i7; it is a random distribution. 
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Consider now a finite partition of i7. |Sethuraman| ( | 1 994 D showed that the 
probabihty assigned by G to the cells of this partition follows a Dirichlet 
distribution. Moreover, if we consider all possible finite partitions of i7, 
the resulting Dirichlet distributions are consistent with each other. Thus, 
by appealing to the Kolmogorov consistency theorem ( ^Billingsley[ |1995| ), 
we can view G as a draw from an underlying stochastic process, where the 
index set is the set of Borel sets of Q. This stochastic process is known as 
the Dirichlet process ( |Ferguson[|1973| ). 

Note that if we truncate the stick-breaking process after L — 1 breaks, we 
obtain a Dirichlet distribution on an L-dimensional vector. The first L — 1 
components of this vector manifest the same kind of bias towards larger 
values for earlier components as the full stick-breaking distribution. How- 
ever, the last component 9l represents the portion of the stick that remains 
after L — 1 breaks and has less of a bias toward small values than in the 
untruncated case. 

Finally, we will find it convenient to define a two-parameter variant of 
the GEM distribution that allows control over both the mean and variance 
of stick lengths. We denote this distribution as GEM(m, tt), in which tt > 

and m G (0, 1). In this variant, the stick lengths are defined as Vi ^ 
Beta(m7r, (1 — m)7r). The standard GEM(7) is the special case when rrni = 

1 and 7 = (1 — m)7r. Note that its mean and variance are tied through its 
single parameter. 



3. The nested Chinese restaurant process 

The Chinese restaurant process and related distributions are widely used 
in Bayesian nonparametric statistics because they make it possible to define 
statistical models in which observations are assumed to be drawn from an 
unknown number of classes. However, this kind of model is limited in 
the structures that it allows to be expressed in data. Analyzing the richly 
structured data that are common in computer science requires extending 
this approach. In this section we discuss how similar ideas can be used 
to define a probability distribution on infinitely-deep, infinitely-branching 
trees. This distribution is subsequently used as a prior distribution in a 
hierarchical topic model that identifies documents with paths down the tree. 

A tree can be viewed as a nested sequence of partitions. We obtain a 
distribution on trees by generalizing the CRP to such sequences. Specifi- 
cally, we define a nested Chinese restaurant process (nCRP) by imagining 
the following scenario for generating a sample. Suppose there are an infi- 
nite number of infinite-table Chinese restaurants in a city. One restaurant 
is identified as the root restaurant, and on each of its infinite tables is a 
card with the name of another restaurant. On each of the tables in those 



THE NESTED CHINESE RESTAURANT PROCESS 



11 




Figure 3 . A configuration of the nested Chinese restau- 
rant process illustrated to three levels. Each box represents 
a restaurant with an infinite number of tables, each of which 
refers to a unique table in the next level of the tree. In 
this configuration, five tourists have visited restaurants along 
four unique paths. Their paths trace a subtree in the infinite 
tree. (Note that the configuration of customers within each 
restaurant can be determined by observing the restaurants 
chosen by customers at the next level of the tree.) In the 
hLDA model of Section |4l each restaurant is associated with 
a topic distribution /?. Each document is assumed to choose 
its words from the topic distributions along a randomly cho- 
sen path. 



restaurants are cards that refer to other restaurants, and this structure re- 
peats infinitely many timesj^ Each restaurant is referred to exactly once; 
thus, the restaurants in the city are organized into an infinitely-branched, 
infinitely-deep tree. Note that each restaurant is associated with a level in 
this tree. The root restaurant is at level 1, the restaurants referred to on its 
tables' cards are at level 2, and so on. 

A tourist arrives at the city for an culinary vacation. On the first evening, 
he enters the root Chinese restaurant and selects a table using the CRP distri- 
bution in Eq. ([T]). On the second evening, he goes to the restaurant identified 
on the first night's table and chooses a second table using a CRP distribution 
based on the occupancy pattern of the tables in the second night's restaurant. 
He repeats this process forever. After M tourists have been on vacation in 
the city, the collection of paths describes a random subtree of the infinite 
tree; this subtree has a branching factor of at most M at all nodes. See 
Figure |3] for an example of the first three levels from such a random tree. 



A finite-depth precursor of this model was presented in 



Blei et al. 



(20031. 
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There are many ways to place prior distributions on trees, and our spe- 
cific choice is based on several considerations. First and foremost, a prior 
distribution combines with a likelihood to yield a posterior distribution, and 
we must be able to compute this posterior distribution. In our case, the 
likelihood will arise from the hierarchical topic model to be described in 
Section]?} As we will show in Section [5| the specific prior that we propose 
in this section combines with the likelihood to yield a posterior distribution 
that is amenable to probabilistic inference. Second, we have retained impor- 
tant aspects of the CRP, in particular the "preferential attachment" dynamics 
that are built into Eq. ([T]). Probability structures of this form have been used 
as models in a variety of applications ( [Barabasi and Reka]|1999[|Krapivsky| 



and Rednerj [200T1 [Albert and Barabaslj [2002t [Drinea et a: 



2006), and the 



clustering that they induce makes them a reasonable starting place for a 
hierarchical topic model. 

In fact, these two points are intimately related. The CRP yields an ex- 
changeable distribution across partitions, i.e., the distribution is invariant to 
the order of the arrival of customers ( |Pitman[ [20021 ). This exchangeability 
property makes CRP-based models amenable to posterior inference using 



Monte Carlo methods (Escobar and West, 1995; MacEachem and MuUer 



1998; Neal 2000). 



4. Hierarchical latent Dirichlet allocation 

The nested CRP provides a way to define a prior on tree topologies that 
does not limit the branching factor or depth of the trees. We can use this 
distribution as a component of a probabilistic topic model. 

The goal of topic modeling is to identify subsets of words that tend to 
co-occur within documents. Some of the early work on topic modeling 
derived from latent semantic analysis, an application of the singular value 
decomposition in which "topics" are viewed post hoc as the basis of a low- 



dimensional subspace ( [Deerwester et al.[[l990 ). Subsequent work treated 



topics as probability distributions over words and used likelihood-based 
methods to estimate these distributions from a corpus (| Hofmann| 1 1 999b| ) . In 
both of these approaches, the interpretation of "topic" differs in key ways 
from the clustering metaphor because the same word can be given high 
probability (or weight) under multiple topics. This gives topic models the 
capability to capture notions of polysemy (e.g., "bank" can occur with high 
probability in both a finance topic and a waterways topic). Probabilistic 
topic models were given a fully Bayesian treatment in the latent Dirichlet 
allocation (LDA) model ( |B lei et al.j [2003] ). 

Topic models such as LDA treat topics as a "flat" set of probability dis- 
tributions, with no direct relationship between one topic and another. While 
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these models can be used to recover a set of topics from a corpus, they fail 
to indicate the level of abstraction of a topic, or how the various topics are 
related. The model that we present in this section builds on the nCRP to 
define a hierarchical topic model. This model arranges the topics into a 
tree, with the desideratum that more general topics should appear near the 



root and more specialized topics should appear near the leaves (Hofmann 



|1999a| ). Having defined such a model, we use probabilistic inference to 
simultaneously identify the topics and the relationships between them. 

Our approach to defining a hierarchical topic model is based on iden- 
tifying documents with the paths generated by the nCRP. We augment the 
nCRP in two ways to obtain a generative model for documents. First, we as- 
sociate a topic, i.e., a probability distribution across words, with each node 
in the tree. A path in the tree thus picks out an infinite collection of topics. 
Second, given a choice of path, we use the GEM distribution to define a 
probability distribution on the topics along this path. Given a draw from a 
GEM distribution, a document is generated by repeatedly selecting topics 
according to the probabilities defined by that draw, and then drawing each 
word from the probability distribution defined by its selected topic. 

More formally, consider the infinite tree defined by the nCRP and let 
Cd denote the path through that tree for the rfth customer (i.e., document). 
In the hierarchical LDA (hLDA) model, the documents in a corpus are as- 
sumed drawn from the following generative process: 

(1) For each table G T in the infinite tree, 

(a) Draw a topic f3k ^ Dirichlet(77). 

(2) For each document, G {1, 2, . . . , D} 

(a) DrawQ - nCRP(7). 

(b) Draw a distribution over levels in the tree, 9d \ {m, tt} ^ GEM(m, tt) 

(c) For each word, 

(i) Choose level Z^,^ | 6 - M\x\i{9d). 

(ii) Choose word Wd^n \ {zd^n, c^, (3) Mult(/?c^ [^d,n] ), which 
is parameterized by the topic in position Zd^n on the path 

This generative process defines a probability distribution across possible 
corpora. 

The goal of finding a topic hierarchy at different levels of abstraction 



is distinct from the problem of hierarchical clustering Zamir and Etzioni 
([T9981); [LaFsen and Aone] ([T9991 ); [Vaithyanathan and Dom] (2000); lDuda 



etaL] pOOOl ); [Hastie et aL] ( [2001] ); [Heller and Ghahramani| pD05p . Hierar- 



chical clustering treats each data point as a leaf in a tree, and merges similar 
data points up the tree until all are merged into a root node. Thus, internal 
nodes represent summaries of the data below which, in this setting, would 
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yield distributions across words that share high probabihty words with their 
children. 

In the hierarchical topic model, the internal nodes are not summaries of 
their children. Rather, the internal nodes reflect the shared terminology of 
the documents assigned to the paths that contain them. This can be seen in 
Figure [TJ where the high probability words of a node are distinct from the 
high probability words of its children. 

It is important to emphasize that our approach is an unsupervised learning 
approach in which the probabilistic components that we have defined are 
latent variables. That is, we do not assume that topics are predefined, nor 
do we assume that the nested partitioning of documents or the allocation 
of topics to levels are predefined. We infer these entities from a Bayesian 
computation in which a posterior distribution is obtained from conditioning 
on a corpus and computing probabilities for all latent variables. 

As we will see experimentally, there is statistical pressure in the posterior 
to place more general topics near the root of the tree and to place more spe- 
cialized topics further down in the tree. To see this, note that each path in the 
tree includes the root node. Given that the GEM distribution tends to assign 
relatively large probabilities to small integers, there will be a relatively large 
probability for documents to select the root node when generating words. 
Therefore, to explain an observed corpus, the topic at the root node will 
place high probability on words that are useful across all the documents. 

Moving down in the tree, recall that each document is assigned to a single 
path. Thus, the first level below the root induces a coarse partition on the 
documents, and the topics at that level will place high probability on words 
that are useful within the corresponding subsets. As we move still further 
down, the nested partitions of documents become finer. Consequently, the 
corresponding topics will be more specialized to the particular documents 
in those paths. 

We have presented the model as a two-phase process: an infinite set of 
topics are generated and assigned to all of the nodes of an infinite tree, and 
then documents are obtained by selecting nodes in the tree and drawing 
words from the corresponding topics. It is also possible, however, to con- 
ceptualize a "lazy" procedure in which a topic is generated only when a 
node is first selected. In particular, consider an empty tree (i.e., containing 
no topics) and consider generating the first document. We select a path and 
then repeatedly select nodes along that path in order to generate words. A 
topic is generated at a node when that node is first selected and subsequent 
selections of the node reuse the same topic. 

After n words have been generated, at most n nodes will have been vis- 
ited and at most n topics will have been generated. The (n + l)th word in 
the document can come from one of previously generated topics or it can 
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come from a new topic. Similarly, suppose that d documents have previ- 
ously been generated. The {d + l)th document can follow one of the paths 
laid down by an earlier document and select only "old" topics, or it can 
branch off at any point in the tree and generated "new" topics along the new 
branch. 

This discussion highlights the nonparametric nature of our model. Rather 
than describing a corpus by using a probabilistic model involving a fixed 
set of parameters, our model assumes that the number of parameters can 
grow as the corpus grows, both within documents and across documents. 
New documents can spark new subtopics or new specializations of existing 
subtopics. Given a corpus, this flexibility allows us to use approximate pos- 
terior inference to discover the particular tree of topics that best describes 
its documents. 

It is important to note that even with this flexibility, the model still makes 
assumptions about the tree. Its size, shape, and character will be affected 
by the settings of the hyperparameters. The most influential hyperparame- 
ters in this regard are the Dirichlet parameter for the topics r] and the stick- 
breaking parameters for the topic proportions {m, tt}. The Dirichlet param- 
eter controls the sparsity of the topics; smaller values of r] will lead to topics 
with most of their probability mass on a small set of words. With a prior 
bias to sparser topics, the posterior will prefer more topics to describe a col- 
lection and thus place higher probability on larger trees. The stick-breaking 
parameters control how many words in the documents are likely to come 
from topics of varying abstractions. If we set tt to be large (e.g., tt = 0.5) 
then the posterior will more likely assign more words from each document 
to higher levels of abstraction. Setting m to be large (e.g., m = 100) means 
that word allocations will not likely deviate from such a setting. 

How we set these hyperparameters depends on the goal of the analysis. 
When we analyze a document collection with hLDA for discovering and 
visualizing a hierarchy embedded within it, we might examine various set- 
tings of the hyperparameters to find a tree that meets our exploratory needs. 
We analyze documents with this purpose in mind in Section |6.2[ In a dif- 
ferent setting, when we are looking for a good predictive model of the data, 
e.g., to compare hLDA to other statistical models of text, then it makes 
sense to "fit" the hyperparameters by placing priors on them and computing 
their posterior. We describe posterior inference for the hyperparameters in 
Section [54] and analyze documents using this approach in Section [63 



Finally, we note that hLDA is the simplest model that exploits the nested 
CRP, i.e., a flexible hierarchy of distributions, in the topic modeling frame- 
work. In a more complicated model, one could consider a variant of hLDA 
where each document exhibits multiple paths through the tree. This can be 
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modeled using a two-level distribution for word generation: first choose a 
path through the tree, and then choose a level for the word. 

Recent extensions to topic models can also be adapted to make use of a 
flexible topic hierarchy. As examples, in the dynamic topic model the doc- 



uments are time stamped and the underlying topics change over time ( |Blei 
and Lafferty||2006| ); in the author- topic model the authorship of the docu 



ments affects which topics they exhibit ( |Rosen-Zvi et al.[[2004| ). This said, 
some extensions are more easily adaptable than others. In the correlated 



topic model, the topic proportions exhibit a covariance structure ( |Blei and 
|Lafferty[ |2007| ). This is achieved by replacing a Dirichlet distribution with 
a logistic normal, and the application of Bayesian nonparametric extensions 
is less direct. 

4.1. Related work. In previous work, researchers have developed a num- 
ber of methods that employ hierarchies in analyzing text data. In one line 
of work, the algorithms are given a hierarchy of document categories, and 



their goal is to correctly place documents within it ( |Koller and Sahami 



T9971 IChakrabarti et all [T9981 [McCallum et al.j [T9991 [Dumais and Chen 



2000). Other work has focused on deriving hierarchies of individual terms 



using side information, such as a grammar or a thesaurus, that are some- 



times available for text domains (Sanderson and Croft, 1999; Stoica and 



Hearsti[2004HCimiano et al.j [2005] ). 



Our method provides still another way to employ a notion of hierarchy 
in text analysis. First, rather than learn a hierarchy of terms we learn a hi- 
erarchy of topics, where a topic is a distribution over terms that describes 
a significant pattern of word co-occurrence in the data. Moreover, while 
we focus on text, a "topic" is simply a data-generating distribution; we do 
not rely on any text-specific side information such as a thesaurus or gram- 
mar. Thus, by using other data types and distributions, our methodology is 
readily applied to biological data sets, purchasing data, collections of im- 
ages, or social network data. (Note that applications in such domains have 



already been demonstrated for flat topic models (Pritchard et al. 



2000 i Mar- 



[Iml |2003t iFei-Fei and Peronaj |2005i [Blei and Jordan! [20031 lAiroldi et al. 



2008V) Finally, as a Bayesian nonparametric model, our approach can ac- 



commodate future data that might lie in new and previously undiscovered 
parts of the tree. Previous work commits to a single fixed tree for all future 
data. 



5. Probabilistic inference 

With the hLDA model in hand, our goal is to perform posterior inference, 
i.e., to "invert" the generative process of documents described above for es- 
timating the hidden topical structure of a document collection. We have 
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constructed a joint distribution of hidden variables and observations — the 
latent topic structure and observed documents — ^by combining prior expec- 
tations about the kinds of tree topologies we will encounter with a genera- 
tive process for producing documents given a particular topology. We are 
now interested in the distribution of the hidden structure conditioned on 
having seen the data, i.e., the distribution of the underlying topic structure 
that might have generated an observed collection of documents. Finding 
this posterior distribution for different kinds of data and models is a central 



problem in Bayesian statistics. See |Bernardo and Smith] ( |1994| ) and |Gelman 
|et al.| ( p^95] ) for general introductions to Bayesian statistics. 

In our nonparametric setting, we must find a posterior distribution on 
countably infinite collections of objects — hierarchies, path assignments, and 
level allocations of words — given a collection of documents. Moreover, we 
need to be able to do this using the finite resources of the computer. Not 
surprisingly, the posterior distribution for hLDA is not available in closed 
form. We must appeal to an approximation. 

We develop a Markov chain Monte Carlo (MCMC) algorithm to approx- 
imate the posterior for hLDA. In MCMC, one samples from a target dis- 
tribution on a set of variables by constructing a Markov chain that has the 
target distribution as its stationary distribution ( [Robert and Casella[ |2004| ). 
One then samples from the chain for sufficiently long that it approaches the 
target, collects the sampled states thereafter, and uses those collected states 
to estimate the target. This approach is particularly straightforward to apply 
to latent variable models, where we take the state space of the Markov chain 
to be the set of values that the latent variables can take on, and the target 
distribution is the conditional distribution of these latent variables given the 
observed data. 

The particular MCMC algorithm that we present in this paper is a Gibbs 
sampling algorithm ( [Geman and Gemanl|1984[|Gelfand and Smith[[l990| ). 
In a Gibbs sampler each latent variable is iteratively sampled conditioned 
on the observations and all the other latent variables. We employ collapsed 
Gibbs sampling ( |Liu[|1994] ), in which we marginalize out some of the latent 
variables to speed up the convergence of the chain. Collapsed Gibbs sam- 



pling for topic models (Griffiths and Steyvers 



in a number of topic modeling applications ([McCallum et al.[ '2004 ; Rosen 



2004) has been widely used 



Zvi et al.', 12 0041 |Mimno and McCalluml [20071 [Pietz et al.j [2007, .Newman 
etal.,2006). 



In hLDA, we sample the per-document paths and the per- word level 
allocations to topics in those paths Zd^n- We marginalize out the topic pa- 
rameters (3i and the per-document topic proportions Od- The state of the 
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Alio previouslyi knowno efficient2 maximum-flow algorithms/ worki byo finding^ 
augmenting pathsi, eitheri oneo patho ato ao timei (aSg ino theo original2 Ford ando 
Fulkerson algorithmi) org alio shortest-length augmenting pathsi ato onceo (usingg 
theo layered network2 approach^ ofo Dinic). Ano alternativei methodo basedo ono theo 
preflow concepto ofo Karzanov iso introducedo- Ao preflow iso likeo ao flow2 excepti 
thato theo totalo amounto flowing intoi ao vertex2 iso allowedo tOo exceed theo totalo 
amounto flowing out2. Theo methodo maintains ao preflow ino theo original2 network2 
ando pushes localo flow2 excess towardi theo sink alongo whato arco estimated tOo bco 
shortest2 pathsi. Theo algorithm^ ando itso analysiso arco simple^ ando intuitivei, yeto 
theo algorithm^ runso aso fasti aso anyo othero knowno methodo ono dense graphs2 
achieving ano 0(n) timei boundi ono ano n-vertex2 graph2 byo incorporating theo 
dynamici treei datao structurei ofo Sleator ando Tarjan... 



Figure 4. A single state of the Markov chain in the 
Gibbs sampler for the abstract of "A new approach to the 
maximum-flow problem" [Goldberg and Tarjan, 1986]. The 
document is associated with a path through the hierarchy Cd, 
and each node in the hierarchy is associated with a distri- 
bution over terms. (The five most probable terms are illus- 
trated.) Finally, each word in the abstract Wd^n is associated 
with a level in the path through the hierarchy Zd^n^ with 
being the highest level and 2 being the lowest. The Gibbs 
sampler iteratively draws c^^ and Zd^n for all words in all doc- 
uments (see Section [5]). 
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Markov chain is illustrated, for a single document, in Figure |4} (The partic- 
ular assignments illustrated in the figure are taken at the approximate mode 
of the hLDA model posterior conditioned on abstracts from the JACM.) 

Thus, we approximate the posterior p{ci,D, Zi,d \ 7, ^, tt, Wi:d). The 
hyperparameter 7 reflects the tendency of the customers in each restaurant 
to share tables, rj reflects the expected variance of the underlying topics (e.g, 

<C 1 will tend to choose topics with fewer high-probability words), and m 
and TT reflect our expectation about the allocation of words to levels within 
a document. The hyperparameters can be fixed according to the constraints 
of the analysis and prior expectation about the data, or inferred as described 
in Section I531 

Intuitively, the CRP parameter 7 and topic prior r] provide control over 
the size of the inferred tree. For example, a model with large 7 and small 
77 will tend to find a tree with more topics. The small rj encourages fewer 
words to have high probability in each topic; thus, the posterior requires 
more topics to explain the data. The large 7 increases the likelihood that 
documents will choose new paths when traversing the nested CRP. 

The GEM parameter m reflects the proportion of general words relative 
to specific words, and the GEM parameter tt reflects how strictly we expect 
the documents to adhere to these proportions. A larger value of tt enforces 
the notions of generality and specificity that lead to more interpretable trees. 

The remainder of this section is organized as follows. First, we outline 
the two main steps in the algorithm: the sampling of level allocations and 
the sampling of path assignments. We then combine these steps into an 
overall algorithm. Next, we present prior distributions for the hyperparam- 
eters of the model and describe posterior inference for the hyperparameters. 
Finally, we outline how to assess the convergence of the sampler and ap- 
proximate the mode of the posterior distribution. 

5.1. Sampling level allocations. Given the current path assignments, we 
need to sample the level allocation variable Zd^n for word n in document d 
from its distribution given the current values of all other variables: 
(2) 

|z-(d,n),c,w,m,7r,7y) ocp(zrf,^ |zrf_^,m,7r)p(^rf,^ |z,c, w_(rf,^),7y), 

where Z-{d,n) and w_(^ are the vectors of level allocations and observed 
words leaving out Zd^n and Wd^n respectively. We will use similar notation 
whenever items are left out from an index set; for example, Zd-n denotes 
the level allocations in document rf, leaving out Zd^n- 

The first term in Eq. ([2]) is a distribution over levels. This distribution has 
an infinite number of components, so we sample in stages. First, we sample 
from the distribution over the space of levels that are currently represented 
in the rest of the document, i.e., max(z^ _^), and a level deeper than that 
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level. The first components of this distribution are, for k < max(z^ _^), 



k-1 



Vk\{V, 



k-1 

= E[Vk I Zd-n, m, tt] Y[ E[1 - Vj \ m, tt] 

k—l 

^ (1 - m)7T + #[Zd-n = fc] T-r mTT + #[z^ > j] 

71 + #[zd,_n > k] 7r + #[z^,_, > 

where #[•] counts the elements of an array satisfying a given condition. 

The second term in Eq. ([2]) is the probability of a given word based on a 
possible assignment. From the assumption that the topic parameters are 
generated from a Dirichlet distribution with hyperparameters rj we obtain 
(3) 

which is the smoothed frequency of seeing word Wd^n allocated to the topic 
at level Zd^n of the path Cd. 

The last component of the distribution over topic assignments is 

max(zd,-n) 

p{Zd,n > m^x{Zd-n) I -n, W, TTZ, TT, T^) = 1" = j \ Zd-n, W, m, TT, 7y). 

If the last component is sampled then we sample from a Bernoulli distribu- 
tion for increasing values of starting with £ = max(Z(^ + 1, until we 
determine Zd^n^ 

p{Zd,n = £ I Zd-n,Zd,n > £ - 1, W,m,7r,r/) = (1 - m)p{Wd,n I Z,C, W_(rf,^),r/) 

P{Zd,n > ^ I Zd-n, Zd,n > £ - l) = 1 " p{Zd,n = ^ | -n, > ^ - 1, W, 771, TT, 7/). 

Note that this changes the maximum level when resampling subsequent 
level assignments. 

5.2. Sampling paths. Given the level allocation variables, we need to sam- 
ple the path associated with each document conditioned on all other paths 
and the observed words. We appeal to the fact that max(z^) is finite, and 
are only concerned with paths of that length: 

(4) p{Cd I W, C_rf, Z, T], 7) OC p{Cd I C_rf, -f)p{yVd I C, W_rf, Z, 7]). 

This expression is an instance of Bayes's theorem with p{y/Vd \ c, w_rf, z, 77) 
as the probability of the data given a particular choice of path, mdp{cd \ c_rf, 7) 
as the prior on paths implied by the nested GRP. The probability of the data 
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is obtained by integrating over the multinomial parameters, which gives a 
ratio of normalizing constants for the Dirichlet distribution, 

p(wd I c, w_d, z, r?) = 

max(z,) ^ ^ ^ ^^^^^ ^ ^ r (#[Z = C, = Q,,, W = w] + v) 



n 



1 ^ = ^, c_rf,^ = Cd^i, w_d = H + ^) ^ (E^ #[z = ^, = Crf,^, w = ^] + l/r/) ' 

where we use the same notation for counting over arrays of variables as 
above. Note that the path must be drawn as a block, because its value at each 
level depends on its value at the previous level. The set of possible paths 
corresponds to the union of the set of existing paths through the tree, each 
represented by a leaf, with the set of possible novel paths, each represented 
by an internal node. 

5.3. Summary of Gibbs sampling algorithm. With these conditional dis- 
tributions in hand, we specify the full Gibbs sampling algorithm. Given the 
current state of the sampler, {c[^^^, z[^^^}, we iteratively sample each vari- 
able conditioned on the rest. 

(1) For each document G {1, . . . , D} 

(a) Randomly draw c^^^^^ from Eq. MJ). 

(b) Randomly draw ^^^^^^ fromEq. (2h for each word, n G {1, . . . Nd}. 

The stationary distribution of the corresponding Markov chain is the con- 
ditional distribution of the latent variables in the hLDA model given the 
corpus. After running the chain for sufficiently many iterations that it can 
approach its stationary distribution (the "burn-in") we can collect samples 
at intervals selected to minimize autocorrelation, and approximate the true 
posterior with the corresponding empirical distribution. 

Although this algorithm is guaranteed to converge in the limit, it is dif- 
ficult to say something more definitive about the speed of the algorithm 
independent of the data being analyzed. In hLDA, we sample a leaf from 
the tree for each document and a level assignment for each word Zd^n- As 
described above, the number of items from which each is sampled depends 
on the current state of the hierarchy and other level assignments in the doc- 
ument. Two data sets of equal size may induce different trees and yield 
different running times for each iteration of the sampler. For the corpora 



analyzed below in Section [O} the Gibbs sampler averaged 0.001 seconds 
per document for the JACM data and Psychological Review data, and 0.006 
seconds per document for the Proceedings of the National Academy of Sci- 
ences dataEl 



Timings were measured with the Gibbs sampler running on a 2.2GHz Opteron 275 
processor. 
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Log complete probability for the JACIVI corpus 
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Figure 5. (Left) The complete log likelihood of Eq. ^ 
for the first 2000 iterations of the Gibbs sampler run on the 
JACM corpus of Section |6.2[ (Right) The autocorrelation 



function (ACF) of the log complete log likelihood (with con- 
fidence interval) for the remaining 8000 iterations. The auto- 
correlation decreases rapidly as a function of the lag between 
samples. 

5.4. Sampling the hyperparameters. The values of hyperparameters are 
generally unknown a priori. We include them in the inference process by 
endowing them with prior distributions, 

m ^ Beta(ai,Qf2) 

TT ^ Exponential (0^3) 

7 ^ Gamma(a4,a5) 

T] ^ Exponential (ae)- 

These priors also contain parameters ("hyper-hyperparameters"), but the re- 
sulting inferences are less influenced by these hyper-hyperparameters than 
they are by fixing the original hyperparameters to specific values [Bernardo 
land Smith] ( [T994| ). 



To incorporate this extension into the Gibbs sampler, we interleave Metropolis- 
Hastings (MH) steps between iterations of the Gibbs sampler to obtain new 
values of m, tt, 7, and r]. This preserves the integrity of the Markov chain, 
although it may mix slower than the collapsed Gibbs sampler without the 
MH updates ( [Robert and Casella[[2004| ). 
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5.5. Assessing convergence and approximating the mode. Practical ap- 
plications must address the issue of approximating the mode of the distribu- 
tion on trees and assessing convergence of the Markov chain. We can obtain 
information about both by examining the log probability of each sampled 
state. For a particular sample, i.e., a configuration of the latent variables, 
we compute the log probability of that configuration and observations, con- 
ditioned on the hyperparameters: 

(5) C^^^ = logp(c[^^^,z[^^^,Wi:2^ |7,r7,m,7r). 

With this statistic, we can approximate the mode of the posterior by choos- 
ing the state with the highest log probability. Moreover, we can assess 
convergence of the chain by examining the autocorrelation of C^^\ Fig- 
ure [5] (right) illustrates the autocorrelation as a function of the number of 
iterations between samples (the "lag") when modeling the JACM corpus 
described in Section |6.2[ The chain was run for 10,000 iterations; 2000 
iterations were discarded as burn-in. 

Figure|5](left) illustrates Eq. ([5]) for the burn-in iterations. Gibbs samplers 
stochastically climb the posterior distribution surface to find an area of high 
posterior probability, and then explore its curvature through sampling. In 
practice, one usually restarts this procedure a handful of times and chooses 



the local mode which has highest posterior likelihood ( [Robert and Casella 
[2004| ). 

Despite the lack of theoretical guarantees, Gibbs sampling is appropriate 
for the kind of data analysis for which hLDA and many other latent vari- 
able models are tailored. Rather than try to understand the full surface of 
the posterior, the goal of latent variable modeling is to find a useful rep- 
resentation of complicated high-dimensional data, and a local mode of the 
posterior found by Gibbs sampling often provides such a representation. In 
the next section, we will assess hLDA qualitatively, through visualization 
of summaries of the data, and quantitatively, by using the latent variable 
representation to provide a predictive model of text. 



6. Examples and empirical results 

We present experiments analyzing both simulated and real text data to 
demonstrate the application of hLDA and its corresponding Gibbs sampler. 

6.1. Analysis of simulated data. In Figure |6} we depict the hierarchies 
and allocations for ten simulated data sets drawn from an hLDA model. For 
each data set, we draw 100 documents of 250 words each. The vocabulary 
size is 100, and the hyperparameters are fixed a.tr] = .005, and 7 = L In 
these simulations, we truncated the stick-breaking procedure at three levels. 
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Figure 6. Inferring the mode of the posterior hierarchy 
from simulated data. See Section ISTTl 

and simply took a Dirichlet distribution over the proportion of words allo- 
cated to those levels. The resulting hierarchies shown in Figure |6] illustrate 
the range of structures on which the prior assigns probability. 
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In the same figure, we illustrate the estimated mode of the posterior distri- 
bution across the hierarchy and allocations for the ten data sets. We exactly 
recover the correct hierarchies, with only two errors. In one case, the error 
is a single wrongly allocated path. In the other case, the inferred mode has 
higher posterior probability than the true tree structure (due to finite data). 

In general we cannot expect to always find the exact tree. This is depen- 
dent on the size of the data set, and how identifiable the topics are. Our 
choice of small r] yields topics that are relatively sparse and (probably) very 
different from each other. Trees will not be as easy to identify in data sets 
which exhibit polysemy and similarity between topics. 



6.2. Hierarchy discovery in scientific abstracts. Given a document col- 
lection, one is typically interested in examining the underlying tree of topics 
at the mode of the posterior. As described above, our inferential procedure 
yields a tree structure by assembling the unique subset of paths contained 
in {ci, . . . , Ci^)} at the approximate mode of the posterior. 

For a given tree, we can examine the topics that populate the tree. Given 
the assignment of words to levels and the assignment of documents to paths, 
the probability of a particular word at a particular node is roughly propor- 
tional to the number of times that word was generated by the topic at that 
node. More specifically, the mean probability of a word winsi topic at level 
£ of path p is given by 

/ I X #[z = ^,c = p,w = ^] +77 

(6) p{w z, c, w, rj) = — — ^— . 

#[z = ^,c = p] + Vr] 

Using these quantities, the hLDA model can be used for analyzing collec- 
tions of scientific abstracts, recovering the underlying hierarchical structure 
appropriate to a collection, and visualizing that hierarchy of topics for a 
better understanding of the structure of the corpora. We demonstrate the 
analysis of three different collections of journal abstracts under hLDA. 

In these analyses, as above, we truncate the stick-breaking procedure at 
three levels, facilitating visualization of the results. The topic Dirichlet hy- 
perparameters were fixed atr; = {2.0, 1.0, 0.5}, which encourages many 
terms in the high-level distributions, fewer terms in the mid-level distribu- 
tions, and still fewer terms in the low-level distributions. The nested CRP 
parameter 7 was fixed at 1.0. The GEM parameters were fixed at m = 100 
and TT = 0.5. This strongly biases the level proportions to place more mass 
at the higher levels of the hierarchy. 

In Figure [TJ we illustrate the approximate posterior mode of a hierarchy 
estimated from a collection of 536 abstracts from the JACM. The tree struc- 
ture illustrates the ensemble of paths assigned to the documents. In each 
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node, we illustrate the top five words sorted by expected posterior proba- 
bility, computed from Eq. ([6]). Several leaves are annotated with document 
titles. For each leaf, we chose the five documents assigned to its path that 
have the highest numbers of words allocated to the bottom level. 

The model has found the function words in the data set, assigning words 
like "the," "of," "or," and "and" to the root topic. In its second level, the 
posterior hierarchy appears to have captured some of the major subfields in 
computer science, distinguishing between databases, algorithms, program- 
ming languages and networking. In the third level, it further refines those 
fields. For example, it delineates between the verification area of network- 
ing and the queuing area. 

In Figure [7} we illustrate an analysis of a collection of 1,272 psychology 
abstracts from Psychological Review from 1967 to 2003. Again, we have 
discovered an underlying hierarchical structure of the field. The top node 
contains the function words; the second level delineates between large sub- 
fields such as behavioral, social and cognitive psychology; the third level 
further refines those subfields. 

Finally, in Figure [8| we illustrate a portion of the analysis of a collec- 
tion of 12,913 abstracts from the Proceedings of the National Academy of 
Sciences from 1991 to 2001. An underlying hierarchical structure of the 
content of the journal has been discovered, dividing articles into groups 
such as neuroscience, immunology, population genetics and enzymology. 

In all three of these examples, the same posterior inference algorithm 
with the same hyperparameters yields very different tree structures for dif- 
ferent corpora. Models of fixed tree structure force us to commit to one in 
advance of seeing the data. The nested Chinese restaurant process at the 
heart of hLDA provides a flexible solution to this difficult problem. 

6.3. Comparison to LDA. In this section we present experiments compar- 
ing hLDA to its non-hierarchical precursor, LDA. We use the infinite-depth 
hLDA model; the per-document distribution over levels is not truncated. We 
use predictive held-out likelihood to compare the two approaches quantita- 
tively, and we present examples of LDA topics in order to provide a quali- 
tative comparison of the methods. LDA has been shown to yield good pre- 
dictive performance relative to competing unigram language models, and it 
has also been argued that the topic-based analysis provided by LDA repre- 
sents a qualitative improvement on competing language models (|Blei et aL 



2003t [Griffiths and Steyversl[2006| ). Thus LDA provides a natural point of 



comparison. 

There are several issues that must be borne in mind in comparing hLDA 
to LDA. First, in LDA the number of topics is a fixed parameter, and a 
model selection procedure is required to choose the number of topics. (A 
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Figure 7. A portion of the hierarchy learned from the 
1,272 abstracts of Psychological Review from 1967-2003. 
The vocabulary was restricted to the 1,971 terms that oc- 
curred in more than five documents, yielding a corpus of 
136K words. The learned hierarchy, of which only a portion 
is illustrated, contains 52 topics. 
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Figure 8. A portion of the hierarchy learned from the 
12,913 abstracts of the Proceedings of the National Academy 
of Sciences from 1991-2001. The vocabulary was restricted 
to the 7,200 terms that occurred in more than five documents, 
yielding a corpus of 2.3M words. The learned hierarchy, of 
which only a portion is illustrated, contains 56 topics. Note 
that the 7 parameter is fixed at a smaller value, to provide a 
reasonably sized topic hierarchy with the significantly larger 
corpus. 
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Bayesian nonparametric solution to this can be obtained with the hierarchi- 
cal Dirichlet process ( |Teh et al.]|2007| ).) Second, given a set of topics, LDA 



places no constraints on the usage of the topics by documents in the corpus; 
a document can place an arbitrary probability distribution on the topics. In 
hLDA, on the other hand, a document can only access the topics that lie 
along a single path in the tree. In this sense, LDA is significantly more 
flexible than hLDA. 

This flexibility of LDA implies that for large corpora we can expect LDA 
to dominate hLDA in terms of predictive performance (assuming that the 
model selection problem is resolved satisfactorily and assuming that hyper- 
parameters are set in a manner that controls overfitting). Thus, rather than 
trying to simply optimize for predictive performance within the hLDA fam- 
ily and within the LDA family, we have instead opted to first run hLDA to 
obtain a posterior distribution over the number of topics, and then to conduct 
multiple runs of LDA for a range of topic cardinalities bracketing the hLDA 
result. This provides an hLDA-centric assessment of the consequences (for 
predictive performance) of using a hierarchy versus a flat model. 

We used predictive held-out likelihood as a measure of performance. The 
procedure is to divide the corpus into Di observed documents and D2 held- 
out documents, and approximate the conditional probability of the held-out 
set given the training set 

(7) p(wf . . . , w5^f -"^^ I wf ^ . . . , w5^\^ M), 

where M represents a model, either LDA or hLDA. We employed collapsed 
Gibbs sampling for both models and integrated out all the hyperparameters 
with priors. We used the same prior for those hyperparameters that exist in 
both models. 

To approximate this predictive quantity, we run two samplers. First, we 
collect 100 samples from the posterior distribution of latent variables given 
the observed documents, taking samples 100 iterations apart and using a 
burn-in of 2000 samples. For each of these outer samples, we collect 800 
samples of the latent variables given the held-out documents and approxi- 
mate their conditional probability given the outer sample with the harmonic 
mean ( |Kass and Raftery[|1995) . Finally, these conditional probabilities are 



averaged to obtain an approximation to Eq. (|7]). 

Figure [9] illustrates the five-fold cross-validated held-out likelihood for 
hLDA and LDA on the JACM corpus. The figure also provides a visual 
indication of the mean and variance of the posterior distribution over topic 
cardinality for hLDA; the mode is approximately a hierarchy with 140 top- 
ics. For LDA, we plot the predictive likelihood in a range of topics around 
this value. 
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Figure 9. The held-out predictive log likelihood for 
hLDA compared to the same quantity for LDA as a func- 
tion of the number of topics. The shaded blue region is cen- 
tered at the mean number of topics in the hierarchies found 
by hLDA (and has width equal to twice the standard error). 
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Figure 10. The five most probable words for each of ten 
randomly chosen topics from an LDA model fit to fifty top- 
ics. 



We see that at each fixed topic cardinality in this range of topics, hLDA 
provides significantly better predictive performance than LDA. As discussed 
above, we eventually expect LDA to dominate hLDA for large numbers of 
topics. In a large range near the hLDA mode, however, the constraint that 
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documents pick topics along single paths in a hierarchy yields superior per- 
formance. This suggests that the hierarchy is useful not only for interpreta- 
tion, but also for capturing predictive statistical structure. 

To give a qualitative sense of the relative degree of interpretability of the 
topics that are found using the two approaches, Figure [10] illustrates ten 
LDA topics chosen randomly from a 50-topic model. As these examples 
make clear, the LDA topics are generally less interpretable than the hLDA 
topics. In particular, function words are given high probability through- 
out. In practice, to sidestep this issue, corpora are often stripped of function 
words before fitting an LDA model. While this is a reasonable ad-hoc so- 
lution for (English) text, it is not a general solution that can be used for 
non-text corpora, such as visual scenes. Even more importantly, there is no 
notion of abstraction in the LDA topics. The notion of multiple levels of 
abstraction requires a model such as hLDA. 

In summary, if interpretability is the goal, then there are strong reasons 
to prefer hLDA to LDA. If predictive performance is the goal, then hLDA 
may well remain the preferred method if there is a constraint that a relatively 
small number of topics should be used. When there is no such constraint, 
LDA may be preferred. These comments also suggest, however, that an 
interesting direction for further research is to explore the feasibility of a 
model that combines the defining features of the LDA and hLDA models. 
As we described in Section |4} it may be desirable to consider an hLDA- 
like hierarchical model that allows each document to exhibit multiple paths 
along the tree. This might be appropriate for collections of long documents, 
such as full-text articles, which tend to be more heterogeneous than short 
abstracts. 



7. Discussion 

In this paper, we have shown how the nested Chinese restaurant process 
can be used to define prior distributions on recursive data structures. We 
have also shown how this prior can be combined with a topic model to yield 
a Bayesian nonparametric methodology for analyzing document collections 
in terms of hierarchies of topics. Given a collection of documents, we use 
MCMC sampling to learn an underlying thematic structure that provides a 
useful abstract representation for data visualization and summarization. 

We emphasize that no knowledge of the topics of the collection or the 
structure of the tree are needed to infer a hierarchy from data. We have 
demonstrated our methods on collections of abstracts from three different 
scientific journals, showing that while the content of these different domains 
can vary significantly, the statistical principles behind our model make it 
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possible to recover meaningful sets of topics at multiple levels of abstrac- 
tion, and organized in a tree. 

The Bayesian nonparametric framework underlying our work makes it 
possible to define probability distributions and inference procedures over 
countably infinite collections of objects. There has been other recent work 
in artificial intelligence in which probability distributions are defined on in- 
finite objects via concepts from first-order logic ( [Milch et aLj |2005[ [Pasula 
and RusselH |2001[ |Poole[ |2007| ). While providing an expressive language, 



this approach does not necessarily yield structures that are amenable to ef- 
ficient posterior inference. Our approach reposes instead on combinatorial 
structure — the exchangeability of the Dirichlet process as a distribution on 
partitions — and this leads directly to a posterior inference algorithm that 
can be applied effectively to large-scale learning problems. 

The hLDA model draws on two complementary insights — one from sta- 
tistics, the other from computer science. From statistics, we take the idea 
that it is possible to work with general stochastic processes as prior distribu- 
tions, thus accommodating latent structures that vary in complexity. This is 
the key idea behind Bayesian nonparametric methods. In recent years, these 



Teh et a 



models have been extended to include spatial models ( |Duan et aL||2007| ) and 
grouped data ( |Teh et al.]|2007 |), and Bayesian nonparametric methods now 
enjoy new applications in computer vision ( jSudderth et al.[ [20051 ), bioinfor- 
matics ( Xing et aT?,'2007), and natural language processing ( [Li et al. , 2007 



2007^iGoldwater et al.[[2006b,a, Johnson et al.[[2007[[Liang et al 



2007). 



From computer science, we take the idea that the representations we in- 
fer from data should be richly structured, yet admit efficient computation. 
This is a growing theme in Bayesian nonparametric research. For example, 
one line of recent research has explored stochastic processes involving mul- 



tiple binary features rather than clusters (Griffiths and Ghahramani, 2006 



[Thibaux and Jordan] [2007 [ [Teh et al.] [2007 ). A parallel line of investiga- 
tion has explored alternative posterior inference techniques for Bayesian 
nonparametric models, providing more efficient algorithms for extracting 
this latent structure. Specifically, variational methods, which replace sam- 
pling with optimization, have been developed for Dirichlet process mix- 
tures to further increase their applicability to large-scale data analysis prob- 
lems dBlei and Jordan! [2005] [Kurihara et aL||2007| ). 

The hierarchical topic model that we explored in this paper is just one 
example of how this synthesis of statistics and computer science can pro- 
duce powerful new tools for the analysis of complex data. However, this 
example showcases the two major strengths of the Bayesian nonparametric 
approach. First, the use of the nested CRP means that the model does not 
start with a fixed set of topics or hypotheses about their relationship, but 
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grows to fit the data at hand. Thus, we learn a topology but do not commit 
to it; the tree can grow as new documents about new topics and subtopics 
are observed. Second, despite the fact that this results in a very rich hy- 
pothesis space, containing trees of arbitrary depth and branching factor, it is 
still possible to perform approximate probabilistic inference using a simple 
algorithm. This combination of flexible, structured representations and effi- 
cient inference makes nonparametric Bayesian methods uniquely promising 
as a formal framework for learning with flexible data structures. 
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